R. J. Serrano
05/12/2023
Learning objectives:
Stationarity and differencing
Non-seasonal ARIMA models
Estimation and order selection
ARIMA modeling in R
Forecasting
Seasonal ARIMA models
ARIMA vs ETS
ARIMA models provide another approach to time series forecasting. While exponential smoothing models are based on a description of the trend and seasonality in the data, ARIMA models aim to describe the autocorrelations in the data.
A stationary time series is one whose statistical properties do not depend on the time at which the series is observed.
Thus, a stationary time series exhibits the following characteristics:
Roughly horizontal (no trend)
Constant variance
No long-term predictable patterns
Source: What is stationarity? https://youtu.be/aIdTGKjQWjA
Figure 9.1: Which of this series are stationary?
In Figure 9.1, note that the Google stock price was non-stationary in panel (a), but the daily changes were stationary in panel (b). This shows one way to make a non-stationary time series stationary — compute the differences between consecutive observations. This is known as differencing.
google_2018 <- gafa_stock |>
filter(Symbol == "GOOG", year(Date) == 2018)
# original time series
google_2018 |>
autoplot(Adj_Close) +
labs(title = 'Google Closing Stock Price ($USD)')google_2018 |> ACF(difference(Close)) |>
autoplot() + labs(subtitle = "Changes in Google closing stock price")The ACF of the differenced Google stock price looks just like that of a white noise series. Since the p-value is greater than 0.05, this suggests that the daily change in the Google stock price is essentially a random amount which is uncorrelated with that of previous days.
A time series is said to follow a random walk process if the predicted value of the series in one period is equivalent to the value of the series in the previous period plus a random error. Source: https://analystprep.com/study-notes/cfa-level-2/random-walk-process/
If differenced series is white noise with zero mean:
where \(\varepsilon_t \sim NID(0,\sigma^2)\).
Random walk models are widely used for non-stationary data, particularly financial and economic data. Random walks typically have:
long periods of apparent trends up or down
sudden and unpredictable changes in direction
A seasonal difference is the difference between an observation and the corresponding observation from the previous year.\[ y'_t = y_t - y_{t-m} \] Antidiabetic drug sales
a10 <- PBS |>
filter(ATC2 == "A10") |>
summarise(Cost = sum(Cost) / 1e6)
a10 |> autoplot(
Cost
) +
labs(title = 'Austrialia monthly scripts for A10 (antidiabetic) drugs sold')a10 |> autoplot(
log(Cost)
) +
labs(title = 'Austrialia monthly scripts for A10 (antidiabetic) drugs sold')a10 |> autoplot(
log(Cost) |> difference(12)
) +
labs(title = 'Annual change in monthly scripts for A10 (antidiabetic) drugs sold')One way to determine more objectively whether differencing is required is to use a unit root test. These are statistical hypothesis tests of stationarity that are designed for determining whether differencing is required.
Source: textbook Chapter 9.1
For example, let us apply it to the Google stock price data.
The p-value is less than 0.05, so the null hypothesis (time series is stationary) is rejected in favor of the alternate hypothesis (time series is NOT stationary).
We can difference the time series and apply the test again.
This time, the p-value is greater than 0.05, so we can conclude that the time series appears stationary.
Exercise:
For the tourism dataset, compute the total number of
trips and find an appropriate differencing (after transformation if
necessary) to obtain stationary data.
A very useful notational device is the backward shift operator, \(B\), which is used as follows: \[ B y_{t} = y_{t - 1} \]
In other words, \(B\), operating on \(y_{t}\), has the effect of shifting the data back one period.
Two applications of \(B\) to \(y_{t}\) shifts the data back two periods: \[ B(By_{t}) = B^{2}y_{t} = y_{t-2} \]
In a multiple regression model, introduced in Chapter 7, we forecast the variable of interest using a linear combination of predictors. In an autoregression model, we forecast the variable of interest using a linear combination of past values of the variable. The term autoregression indicates that it is a regression of the variable against itself.
If we combine differencing with autoregression and a moving average model, we obtain a non-seasonal ARIMA model. ARIMA is an acronym for AutoRegressive Integrated Moving Average (in this context, “integration” is the reverse of differencing).
The full model can be written as :
Example: Egyptian exports
global_economy |>
filter(Code == "EGY") |>
autoplot(Exports) +
labs(y = "% of GDP", title = "Egyptian exports")The following R code selects a non-seasonal ARIMA model automatically.
## Series: Exports
## Model: ARIMA(2,0,1) w/ mean
##
## Coefficients:
## ar1 ar2 ma1 constant
## 1.6764 -0.8034 -0.6896 2.5623
## s.e. 0.1111 0.0928 0.1492 0.1161
##
## sigma^2 estimated as 8.046: log likelihood=-141.57
## AIC=293.13 AICc=294.29 BIC=303.43
Fitted model residual plots
Forecast for 10 years (periods)
fit |> forecast(h=10) |>
autoplot(global_economy) +
labs(y = "% of GDP", title = "Egyptian exports")It is usually not possible to tell, simply from a time plot, what values of \(p\) and \(q\) are appropriate for the data. However, it is sometimes possible to use the ACF plot, and the closely related PACF plot, to determine appropriate values for \(p\) and \(q\).
We can detect a decaying sinusoidal pattern in the ACF, and the PACF shows the last significant spike at lag 4. This is what you would expect from an ARIMA(4,0,0) model.
## Series: Exports
## Model: ARIMA(4,0,0) w/ mean
##
## Coefficients:
## ar1 ar2 ar3 ar4 constant
## 0.9861 -0.1715 0.1807 -0.3283 6.6922
## s.e. 0.1247 0.1865 0.1865 0.1273 0.3562
##
## sigma^2 estimated as 7.885: log likelihood=-140.53
## AIC=293.05 AICc=294.7 BIC=305.41
This model is only slightly worse than the ARIMA(2,0,1) model
identified by ARIMA() (with an AICc value of 294.70
compared to 294.29).
fableThe ARIMA() function in the fable package uses a
variation of the Hyndman-Khandakar algorithm (Hyndman &
Khandakar, 2008), which combines unit root tests, minimisation of
the AICc and MLE to obtain an ARIMA model. The arguments to
ARIMA() provide for many variations on the algorithm. What
is described here is the default behaviour.
When fitting an ARIMA model to a set of (non-seasonal) time series data, the following procedure provides a useful general approach.
Source: Figure 9.12
With ARIMA models, more accurate portmanteau tests are obtained if
the degrees of freedom of the test statistic are adjusted to take
account of the number of parameters in the model. Specifically, we use
ℓ−\(K\) degrees of freedom in the test,
where \(K\) is the number of AR and MA
parameters in the model. So for the non-seasonal models, we have
considered so far, \(K\)=\(p\)+\(q\).
The value of \(K\) is passed to the
ljung_box function via the argument dof, as
shown in the example below.
Example: Central Africa Republic exports
global_economy |>
filter(Code == "CAF") |>
autoplot(Exports) +
labs(title="Central African Republic exports",
y="% of GDP")The time plot shows some non-stationarity, with an overall decline. The improvement in 1994 was due to a new government which overthrew the military junta and had some initial success, before unrest caused further economic decline.
There is no evidence of changing variance, so we will not do a Box-Cox transformation.
To address the non-stationarity, we will take a first difference of the data.
The PACF shown in the above figure is suggestive of an AR(2) model; so an initial candidate model is an ARIMA(2,1,0). The ACF suggests an MA(3) model; so an alternative candidate is an ARIMA(0,1,3).
We fit both an ARIMA(2,1,0) and an ARIMA(0,1,3) model along with two automated model selections, one using the default stepwise procedure, and one working harder to search a larger model space.
caf_fit <- global_economy |>
filter(Code == "CAF") |>
model(arima210 = ARIMA(Exports ~ pdq(2,1,0)),
arima013 = ARIMA(Exports ~ pdq(0,1,3)),
stepwise = ARIMA(Exports),
search = ARIMA(Exports, stepwise=FALSE))
caf_fit |> pivot_longer(!Country, names_to = "Model name",
values_to = "Orders")The four models have almost identical AICc values. Of the models fitted, the full search has found that an ARIMA(3,1,0) gives the lowest AICc value, closely followed by the ARIMA(2,1,0) and ARIMA(0,1,3) — the latter two being the models that we guessed from the ACF and PACF plots. The automated stepwise selection has identified an ARIMA(2,1,2) model, which has the highest AICc value of the four models.
A portmanteau test (setting K=3) returns a large p-value, also suggesting that the residuals are white noise.
Forecast for 5 years (periods)
Note that the mean forecasts look very similar to what we would get with a random walk (equivalent to an ARIMA(0,1,0)). The extra work to include AR and MA terms has made little difference to the point forecasts in this example, although the prediction intervals are much narrower than for a random walk model.